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Abstract 

When faced with learning a set of inter-related tasks from a limited amount of usable data, learning 
each task independently may lead to poor generalization performance. Multi-Task Learning (MTL) 
exploits the latent relations between tasks and overcomes data scarcity limitations by co-learning all these 
tasks simultaneously to offer improved performance. We propose a novel Multi-Task Multiple Kernel 
Learning framework based on Support Vector Machines for binary classification tasks. By considering 
pair-wise task affinity in terms of similarity between a pair’s respective feature spaces, the new framework, 
compared to other similar MTL approaches, offers a high degree of flexibility in determining how similar 
feature spaces should be, as well as which pairs of tasks should share a common feature space in order to 
benefit overall performance. The associated optimization problem is solved via a block coordinate descent, 
which employs a consensus-form Alternating Direction Method of Multipliers algorithm to optimize the 
Multiple Kernel Learning weights and, hence, to determine task affinities. Empirical evaluation on seven 
data sets exhibits a statistically significant improvement of our framework’s results compared to the ones 
of several other Clustered Multi-Task Learning methods. 
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1 Introduction 

Multi-Task Learning (MTL) is a machine learning paradigm, where several related task are learnt simul¬ 
taneously with the hope that, by sharing information among tasks, the generalization performance of each 
task will be improved. The underlying assumption behind this paradigm is that the tasks are related to 
each other. Thus, it is crucial how to capture task relatedness and incorporate it into an MTL framework. 
Although, many different MTL methods [7, 12, 18, 15, 28, 1] have been proposed, which differ in how the 
relatedness across multiple tasks is modeled, they all utilize the parameter or structure sharing strategy to 
capture the task relatedness. 

However, the previous methods are restricted in the sense that they assume all tasks are similarly related 
to each other and can equally contribute to the joint learning process. This assumption can be violated in 
many practical applications as “outlier” tasks often exist. In this case, the effect of “negative transfer”, i.e., 
sharing information between irrelevant tasks, can lead to a degraded generalization performance. 

To address this issue, several methods, along different directions, have been proposed to discover the 
inherent relationship among tasks. For example, some methods [3, 27, 28, 29], use a regularized probabilistic 
setting, where sharing among tasks is done based on a common prior. These approaches are usually compu¬ 
tationally expensive. Another family of approaches, known as the Clustered Multi-Task Learning (CMTL), 
assumes that tasks can be clustered into groups such that the tasks within each group are close to each other 
according to a notion of similarity. Based on the current literature, clustering strategies can be broadly 
classified into two categories: task-level CMTL and feature-level CMTL. 
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The first one, task-level CMTL, assumes that the model parameters used by all tasks within a group are 
close to each other. For example, in [2, 13, 17], the weight vectors of the tasks belonging to the same group 
are assumed to be similar to each other. However, the major limitations for these methods are: (i) that such 
an assumption might be too risky, as similarity among models does not imply that meaningful sharing of 
information can occur between tasks, and (ii) for these methods, the group structure (number of groups or 
basis tasks) is required to be known a priori. 

The other strategy for task clustering, referred to as feature-level CMTL, is based on the assumption 
that task relatedness can be modeled as learning shared features among the tasks within each group. For 
example, in [19] the tasks are clustered into different groups and it is assumed that tasks within the same 
group can jointly learn a shared feature representation. The resulting formulation leads to a non-convex 
objective, which is optimized using an alternating optimization algorithm converging to local optima, and 
suffers potentially from slow convergence. Another similar approach has been proposed in [26], which assumes 
that tasks should be related in terms of feature subsets. This study also leads to a non-convex co-clustering 
structure that captures task-feature relationship. These methods are restricted in the sense that they assume 
that tasks from different groups have nothing in common with each other. However, this assumption is not 
always realistic, as tasks in disjoint groups might still be inter-related, albeit weekly. Hence, assigning tasks 
into different groups may not take full advantage of MTL. Another feature-level clustering model has been 
proposed in [30], in which the cluster structure can vary from feature to feature. While, this model is more 
flexible compared to other CMTL methods, it is, however, more complicated and also less general compared 
to our framework, as it tries to find a shared feature representation for tasks by decomposing each task 
parameter into two parts: one to capture the shared structure between tasks and another to capture the 
variations specific to each task. This model is further extended in [16], where a multi-level structure has 
been introduced to learn task groups in the context of MTL. Interestingly, it has been shown that there 
is an equivalent relationship between CMTL and alternating structure optimization [31], wherein the basic 
idea is to identify a shared low-dimensional predictive structure for all tasks. 

In this paper, we develop a new MTL model capable of modeling a more general type of task relationship, 
where the tasks are implicitly grouped according to a notion of feature similarity. In our framework, the tasks 
are not forced to have a common feature space; instead, the data automatically suggests a flexible group 
structure, in which a common, similar or even distinct feature spaces can be determined between different 
pairs of tasks. Additionally, our MTL framework is kernel-based and, thus, may take advantage of the non¬ 
linearity introduced by the feature mapping of the associated Reproducing Kernel Hilbert Space (RKHS) 'H. 
Also, to avoid a degradation in generalization performance due to choosing an inappropriate kernel function, 
our framework employs a Multiple Kernel Learning (MKL) strategy [21], hence, rendering it a Multi-Task 
Multiple Kernel Learning (MT-MKL) approach. 

It is worth mentioning that a widely adopted practice for combining kernels is to place an L p -norm 
constraint on the combination coefficients 6 = [6 i, ..., 0m], which are learned during training. For example, 
a conically combination of task objectives with an L p -norm feasible region is introduced in [23] and further 
extended in [22], Also, another method introduced in [25] proposes a partially shared kernel function 
k t = X)m=i(^ ,ra + A™)^™,, along with Li-norm constraints on and A. The main advantage of such a 
method over the traditional MT-MKL methods, which consider a common kernel function for all tasks (by 
letting A™ = 0,Vt, m), is that it allows tasks to have their own task-specific feature spaces and, potentially, 
alleviate the effect of negative transfer. However, popular MKL formulations in the context of MTL, such as 
this one, are capable of modeling two types of tasks: those that share a global, common feature space and 
those that employ their own, task-specific feature space. In this work we propose a more flexible framework, 
which, in addition to allowing some tasks to use their own specific feature spaces (to avoid negative transfer 
learning), it permits forming arbitrary groups of tasks sharing the same, group-specific (instead of a single, 
global), common feature space, whenever warranted by the data. This is accomplished by considering a 
group lasso regularizer applied to the set of all pair-wise differences of task-specific MKL weights. For 
no regularization penalty, each task is learned independently of each other and will utilize its own feature 
space. As the regularization penalty increases, pairs of MKL weights are forced to equal each other leading 
the corresponding pairs of tasks to share a common feature space. We demonstrate that the resulting 
optimization problem can be solved by employing a 2-block coordinate descent approach, whose first block 
consists of the Support Vector Machine (SVM) weights for each task and which can be optimized efficiently 
using existing solvers, while its second block comprises the MKL weights from all tasks and is optimized via 
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a consensus-form. Alternating Direction Method of Multipliers (ADMM)-based step. 

The rest of the paper is organized as follows: In Sect. 2 we describe our formulation for jointly learning 
the optimal feature spaces and the parameters of all the tasks. Sect. 3 provides an optimization technique 
to solve our non-smooth convex optimization problem derived in Sect. 2. Sect. 4 presents a Rademacher 
complexity-based generalization bound for the hypothesis space corresponding to our model. Experiments 
are provided in Sect. 5, which demonstrate the effectiveness of our proposed model compared to several MTL 
methods. Finally, in Sect. 6 we conclude our work and briefly summarize our findings. 

Notation: In what follows, we use the following notational conventions: vectors and matrices are depicted 
in bold face. A prime ' denotes vector/matrix transposition. The ordering symbols ^ and A when applied 
to vectors stand for the corresponding component-wise relations. If Z+ is the set of postivie integers, for a 
given S G Z + , we define Ng = {1,..., S}. Additional notation is defined in the text as needed. 


2 Formulation 


Assume T supervised learning tasks, each with a training set £ Nt, which is sampled from 

an unknown distribution P t (x,y) on X x {—1,1}. Here, X denotes the native space of samples for all tasks 
and ±1 are the associated labels. Without loss of generality, we will assume an equal number n of training 
samples per task. The objective is to learn T binary classification tasks using discriminative functions 
ft(x) = (wt,<p t (x)) Ht + bt for t G Nt, where w t is the weight vector associated to task t. Moreover, the 

feature space of task t is served by 'Htfi = ®^f =1 with induced feature mapping </> t = \ ■ ■ ■ 

yjO^ c^m']' and endowed with the inner product (t)^ 8 = Em=i (’> ’)n m ■ The reproducing kernel 

function for this feature space is given as k t {x \, x{ ) = Em=i d r' k rn(x i t,x :1 t ) for all x\,x 3 t G X . In our 
framework, we attempt to learn the Wt s and b t ’s jointly with the Qt s via the following regularized risk 
minimization problem: 


min y 

w^Q{w),G^Q{0),b ^ J 


l™tll 2 


T -1 T 


ii**-*. 


t=i 
A i 


t=l 2=1 t= 1 s>t 

Q (w) ={m = (uji, ■ • • , w T ) : w t G T-L t ,e, 0 G fl (0)} 

n ( e) He = {e u ■ ■ ■ , e T ) ■ e t h o, || 0 t ||i < 1 ,vt g n t } 


(i) 


where w = ,wt) and 6 = (G t , ,0t), Nl (w) and f2 (G) are the corresponding feasible sets for 

w and G respectively, and [u]+ = max {it, 0}, u G R denotes the hinge function. Finally, C and A are 
non-negative regularization parameters. 

The last term in Problem 1 is the sum of pairwise differences between the tasks’ feature weight vectors. 
For each pair of ( Gt,6 s ), the pairwise penalty ||Gt — ||2 may favor a small number of non-identical Gt- 

Therefore, it ensures that a flexible (common, similar or distinct) feature space, will be selected between 
tasks t and s. In this manner, a flexible group structure of shared features across multiple tasks can be 
achieved by this framework. It is also worth mentioning that two special cases are covered by the proposed 
model: (i) if A — > 00 (A is only required to be sufficiently large), for all task pairs || G t — 0 S || 2 — > 0 and, thus, 
all tasks share a single common feature space, (ii) As A —> 0, the proposed model reduces to T independent 
classification tasks. 

It is easy to verify that Problem 1 is a convex minimization problem, which can be solved using a block 
coordinate descent method alternating between the minimization with respect to G and the ( w , b ) pair. 
Motivated by the non-smooth nature of the last regularization term, in Sect. 3 we develop a consensus 
version of the ADMM to solve the minimization problem with respect to 6. 


3 The proposed Consensus Optimization Algorithm 

Problem 1 can be formulated as the following equivalent problem, which entails T inter-related SVM training 
problems: 
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min Y ■ 
e.w.b.e^-' ^ 

t= 1 m=l 


T M 


T -1 T 


26V 


+ C E Es +1 EEi # <- # 


•S M 2 


£=1 s>£ 


a-i- y* ((u^(a;;)> % + &t) >l-a S>0, VteN T ,ieN n 

0 t ^O,||0 t || 1 < l,Vf eN T (2) 

It can be shown that the primal-dual form of Problem 2 with respect to 0 and {w, b, £} is given by 


T T M T—l T 

E “l 1 " “ 5 E E »!"(<»; Y.KTY, «.) + A £ E »»< - «■ 


£=1 £=1 ra=l i=l s>£ 

17 (a) ={a = (a t , • • • , a T ) : 0 ^ a t A CT n , a t y t = 0, V t £ N T } 
n (0) = (0 t) . • •, e T ): Ot h 0, ||0 f ||i < 1, V t £ Nt} 


(3) 


where l n is a vector containing n l’s, Y t = diag(y t ) 7 K™ £ R nxn is the kernel matrix, whose (i,j) entry is 
given as k m (x l t , x J t ), 9 t = [Oj,..., 0^]', and a t is the Lagrangian dual variable for the minimization problem 
w.r.t .{w t ,b t ,£ t }. 

It is not hard to verify that the optimal objective value of the dual problem is equal to the optimal 
objective value of the primal one, as the strong duality holds for the primal-dual optimization problems 
w.r.t.{m, b, £} and a respectively. Therefore, a block coordinate descent framework 1 can be applied to 
decompose Problem 3 into two subproblems. The first subproblem, which is the maximization problem with 
respect to a , can be efficiently solved via LIBSVM [8], and the second subproblem, which is the minimization 
problem with respect to 0, takes the form 


T -1 T T 

n e in A H II~ 6 ^2 + H 6 tit 

4 t=1 s>t t — 1 

s.t. 9 t E 0,110*11! < 1, Vie Nt (4) 

where we defined q™ = — ^a. t YtK™Ytat and q f = [qj ,..., q^]'. Due to the non-smooth nature of Problem 4, 
we derive a consensus ADMM-based optimization algorithm to solve it efficiently. Based on the exposition 
provided in Sections 5 and 7 of [6], it is straightforward to verify that Problem 4 can be written in ADMM 
form as 

N 

min A Y'hiisi) + g{9) + I n(g) {z) 

s,0.z z ' 

2=1 

s.t. Si — 9i = 0, i £ Nat 

z~G = 0 (5) 

where N = T ( T ~ 1 1 , and the local variable Sj £ K 2M consists of two vector variables (sj)j and (sj)j', where 
( s i)j = 0M(ij)■ Note that the index mapping t = A maps the j th component of the local variable s 7 ; to 

the i th component of the global variable 9. Also, 0; can be considered as the global variable’s idea of what 

the local variable Si should be. Moreover, for each i, the function hi(si) is defined as ||(sj)j — (•s,) i f || 2 , and 
the objective term g(9) is given as YVt=i ^tQt- Finally, ln( 0 )( z ) is the indicator function for the constraint 
set 9 (i.e., Iq^{z) = 0 for z £ fi (9), and Iq^(z) = oo for z ^ 17 (0)). 

The augmented Lagrangian (using scaled dual variables) for Problem 5 is 

N N 

L p (s,0,z,M,t>) =A Y^hiisi) + g(9) + I nw {z) + (p/2)^||s» - 0* +Uif 2 
2=1 2=1 

+ {p/2)\\ z ~ 9 + v || 1 , ( 6 ) 

*A MATLAB® implementation of our framework is available at 

https: //github .com / niloofaryousefi/ECML2015 
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where Ui and v are the dual variables for the constraints Si = 0; and z = 9 respectively. Applying ADMM 


on the Lagrangian function given in (6), the following steps are carried out in the /c th iteration 

sf +1 = argmin{A/ii(Si) + (p/2)||sj - 0- + u k |||} (7) 

Si 

N 

6 k+1 = arg min{g(9) + (p/2) ]T|K fc+1 - 9 Z + u k \\l + (p/2)\\z k -9 + v k \\l} (8) 

i =1 

z k+1 = argrrnn{/f 3 (0) (z) + (p/2)\\z - 0 k+1 + v k \\ 2 2 } (9) 

u k+1 = u k + s k+1 9 k+1 ( 10 ) 

v k+i = v k + z k +i _ e k +i (ii) 

where, for each i £ Nat, the s- and it-updates can be carried out independently and in parallel. It is also 
worth mentioning that the s-update is a proximal operator evaluation for ||.||2 which can be simplified to 

s k+1 =S x/p (9 k i+ u k ), (12) 

where S K is the vector-valued soft thresholding (or shrinkage) operator and which is defined as 

S K (a) = (1 — K/||a|| 2 )+a, S K ( 0) = 0. (13) 


Furthermore, as the objective term g is separable in 0 t , the 0-update can be decomposed into T independent 
minimization problems, for which a closed from solution exists 


0t +1 = 


1 


T - 1 


£ {{s i ) k+l +(u i ) k ) + {z k t +v k ) 


( 1 /p)Qt 


VteNr 


(14) 


Algorithm 1 Algorithm for solving Problem 3. 

Input: X 1 ,...,Xt,Y u ...,Y t ,C, A 
Output: 0i,..., Ot , ai,..., olt 
1: Initialize: 9^\ .. ., 9^\ r = 1 

2: Calculate: Base kernel matrices K™ using X t 's for the T tasks and the M kernels. 
3: while not converged do 

4: aM <- argmax aGO(a) Ef=i a 't e ~ \ ELi Em=i(D (r " 1) {<x t Y t K™Y t ait) 

5: (g t m ) (r) «- -±(a' t )^Y t K™Y t (a t )( r \ \/t,m 

6: 0 (r) «- arg min een(e) E J>t II °t - 0 S \\ 2 + X)Li using Algorithm 2 

7: end while 

8: a* = aW 

9: 0* = 0 (r) 


In the third step of the ADMM, we project (G k+1 — v k ) onto the constraint set 17 (0). Note that, this 
set is separable in 0, so the projection step can also be performed independently and in parallel for each 
variable z t , i.e., 


z k+1 =n a{ e)(O k+1 +v k ) 1 V1£N t . 


(15) 


The 2 (-update can also be seen as the problem of finding the intersection between two closed convex sets 
l?i (0) = {0 t y 0 , V t 6 Nt} and 17 2 (0) = {||0t||i < 1, V t £ Nr}, which can be handled using Dykstra’s 
alternating projections method [5, 11] as follows 


y k+1 =U ai{g) (9 k+1 +v k - f3 k ) = 


1 r 




+ vt - Pt 


J + 


Vt e N t 


(16) 


i 


= n fi 2 (e) (y t fc+i +Pt) = p M (yt + Pt) + M 1 M’ yt € 


P k t +1 = Pl 


y t 


.fc+i 


— 2 : 


fc+l 


, V t £ N t 


5 


(17) 

(18) 












where Pm — 


^Im — lA lf M ^ i s the centering matrix. Furthermore, the y t - and z t updates are the Euclidean 

projections onto ft\ (0) and 1?2 (0) respectively with dual variables (3 t £ R Mxl , t = 1 ,T. Finally, we 
update the dual variables Ui and v using the equations given in (10) and (11). 


Algorithm 2 Consensus ADMM algorithm to solve optimization Problem 4 


(V) 

Input: q\ ,.. 
Output: 0±\ 


(r) 

ip 

aw 

, . . , tfrp 

(0) 


e^\k = o 


5: 

6 : 

7: 

8 : 

9: 

10 

11 

12 

13 


Initialize: d 1 
while not converged do 
for i £ Njv, t £ Nt do 
s i +1 t- S\/ p (d- + u k ) 

+ (*f +»t) ^ (1 /p)Qt 


e, 


<— 


T- 1 1 


Vt +1 

fc+1 




flt +1 «- P\ 


1 
2 

Pm(?/ 

k 


k +1 




M? +1 4- It 


fc+1 

yf +1 


Mi 

+rf)+ii 


M 


— Z 


fc+1 


, fe+1 - 0* +1 


nf +1 «- v 

end for 
end while 


„ fc +1 


- 0 


fc+i 


0 (r) «- 0 


(fe+i) 


3.1 Convergence Analysis and Stopping Criteria 

Convergence of Algorithm 2 can be derived based on two mild assumptions similar to the standard conver¬ 
gence theory of the ADMM method discussed in [6]; (i) the objective functions h(s) = Y^iLi II ( s i)j ~ (s-i)y' || 2 
and g{9) = Y^t =l ^tQt are closed, proper and convex, which implies that the subproblems arising in the s- 
update (7) and 0-update (8) are solvable, and (ii) the augmented Lagrangian (6) for p = 0 has a saddle point. 
Under these two assumptions, it can be shown that our ADMM-based algorithm satisfies the following 

• Convergence of residuals : s k — 0 i —> 0 , V i £ Nn, and z k — 6 k —> 0 as k —> oo. 

• Convergence of dual variables: u k —► u*,\/i £ Nn, and v k —> v* as k —> oo, where u* and v* are the 
dual optimal points. 

• Convergence of the objective : h(s k ) + g{z k ) —> p* as k —> oo, which means the objective function (4) 
converges to its optimal value as the algorithm proceeds. 

Also, the algorithm is terminated, when the primal and dual residuals satisfy the following stopping 
criteria 


K fc J 2 <er, 

\\e k dl h<4 Ual 


l|e* 2 || 2 <er, 

Il4l|2<ef al 



(19) 


where the primal residuals of the fc th iteration are given as e pi = s k — 0 fc , e k 2 = z k — 6 k and e k 3 = y k — z k . 
Similarly = p(0 k+1 — Q k ), e^ 2 = p(z k — z k+1 ) and e(j 3 = p(y k — y k+1 )are dual residuals at iteration 
k. Also, the tolerances e prl > 0, and e dual > 0 can be chosen appropriately using the method described in 
Chapter 3 of [6]. 
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3.2 Computational Complexity 

Algorithm 1 needs to compute and cache TM kernel matrices; however, they are computed only once in 
0(TMn 2 ) time. Also, as long as the number of tasks T is not excessive, all the matrices can be computed 
and stored on a single machine, since (i) the number M of kernels, is typically chosen small ( e.g ., we chose 
M = 10), and (ii) the number n of training samples per task is not usually large; if it were large, MTL 
would probably not be able to offer any advantages over training each task independently. For each iteration 
of Algorithm 1, T independent SVM problems are solved at a time cost of 0(n 3 ) per task. Therefore, if 
Algorithm 2 converges in K iterations, the runtime complexity of Algorithm 1 becomes 0(Tn 3 + KMT 2 ) 
per iteration. Note, though, that K is not usually more than a few tens of iterations [6]. 

On the other hand, if the number of tasks T is large, the nature of our problem allows our algorithm 
to be implemented in parallel. The cn-update can be handled as T independent optimization problems, 
which can be easily distributed to T subsystems. Each subsystem N needs to compute once and cache M 
kernel matrices for each task. Then, for each iteration, one SVM problem is required to be solved by each 
subsystem, which takes 0(n 3 ) time. Moreover, our ADMM-based algorithm updating the 0 parameters 
can also be implemented in parallel over i g Nat- Assuming that exchanging data and updates between 
subsystems consumes negligible time, the ADMM only requires 0(KM) time. Therefore, taking advantage 
of a distributed implementation, the complexity of Algorithm 1 is only 0(n 3 + KM) per iteration. 


4 Generalization Bound based on Rademacher Complexity 

In this section, we provide a Rademacher complexity-based generalization bound for the Hypothesis Space 
(HS) considered in Problem 1, which can be identified with the help of the following Proposition 2 . 

Proposition 1. (Proposition 12 in [20], part (a)) LetC C X and let /, g : C e -1 R be two functions. For any 
v > 0, there must exist a r) > 0, such that the optimal solution of (20) is also optimal in (21) 

min f(x) + vg(x) (20) 

x$LC 

min /( x) (21) 

x(zC,g(x)<ri 

Using Proposition 1, one can show that Problem 1 is equivalent to the following problem 

T n 

mi , n c 5Z 1 ( Wt ’ fa ( x t) * yi) 

■WGO ( W ) t=1 , = 1 

n (w) ={w = (wi, ■■■ , w T ) : W t g Ut,e, 0 & Q [0 ), ||tn t || 2 < 

where 

{ T—l T 

e = {e u - ,0t):EEI! 0 ‘-^H2 

£=1 s>t 

The goal here is to choose the w and 0 from their relevant feasible sets, such that the objective function 
of (22) is minimized. Therefore, the relevant hypothesis space for Problem 22 becomes 

F= jz !->• [(«j 1 ,0 1 ),...,(«jt,0t)] '-Vt-Wt £'Htp,\\w t \\ 2 < R t ,6 e f2 (G)} (23) 

Note that finding the Empirical Rademacher Complexity (ERC) of T is complicated due to the non¬ 
smooth nature of the constraint \\&t — @s || 2 5= 7- Instead, we will find the ERC of the HS H 

defined in (24); notice that T QH. 

2 Note that Proposition 1 here utilizes the first part of Proposition 12 in [20] and does not require the strong 
duality assumption, which is necessary for the second part of Proposition 12 in [20]. 


Ru t g N t } (22) 


< 7 
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where 


LL = |a; !->• [(hji, 0 X ),..., (w T , 4 >t)} ■ Vtw t G Lit, 9 , ||^i || 2 < Rt, 0 G fl (0)} 


(24) 


( T-l T 'i 

n (0) 4 12 (0) n J 0 = (0 t) •••,0r):EEll e ‘-^ll2< 7 2 (25) 

i, i=l s>£ ) 

Using the first part of Theorem (12) in [4], it can be shown that the ERC of LL upper bounds the ERC 
of function class T. Thus, the bound derived for Li is also valid for T. The following theorem provides the 
generalization bound for Li. 

Theorem 1. Let Li defined in (24) be the multi-task HS for a class of functions f = (/i,. .., fr) ■ X i-> R T . 
Then for all f £ H, for <5 > 0 and for fixed p > 0, with probability at least 1 — 5 it holds that 


R(f) < R p (f) + - p Xs(H) + 3y (26) 

where 


JHs (R) < & ub {%) 


V3jRM 


nT 


(27) 


where Dls{R), the ERC of LI, is given as 


T n 


K S m = —e, 


sup 


J=(fi,-JT)er t=1 i=1 






teN T ,ieNn 


(28) 


the p-empirical large margin error R p (f), for the training sample S = { (x\, y£) }") T 1 is defined as 


± IL 

R P (/) = -fi E E min ( lj I 1 ~ ylM x t)/p]+) 

t=1 i=1 

Also, R(f) = Pr [yf(x) < 0] is the expected risk w.r.t. 0-1 loss, n is the number of training samples for each 
task, T is the number of tasks to be trained, and M is the number of kernel functions utilized for MKL. 

The proof of this theorem is omitted due to space constraints. Based on Theorem 1, the second term in 
(26), the upper bound for ERC of Li, decreases as the number of tasks increases. Therefore, it is reasonable 
to expect that the generalization performance to improve, when the number T of tasks or the number n 
of training samples increase. Also, due to the formulation’s group lasso (Ti/L 2 -norm) regularizer on the 
pair-wise MKL weight differences, the ERC in (27) depends on M as 0\[M. It is worth mentioning, 
that, while this could be improved to 0\J log M as in [9], if one considers instead a L p /L q -norm regularizer, 
we won’t pursue this avenue here. Let us finally note, that (26) allows one to construct data-dependent 
confidence intervals for the true, pooled (averaged over tasks) misclassification rate of the MTL problem 
under consideration. 


5 Experiments 

In this section, we demonstrate the merit of the proposed model via a series of comparative experiments. For 
reference, we consider two baseline methods referred to as STL and MTL, which present the two extreme 
cases discussed in Sect. 2. We also compare our method with five state-of-the-art methods which, like ours, 
fall under the CMTL family of approaches. These methods are briefly described below. 

8 








• STL: single-task learning approach used as a baseline, according to which each task is individually 
trained via a traditional single-task MKL strategy. 

• MTL: a typical MTL approach, for which all tasks share a common feature space. An SVM-based 
formulation with multiple kernel functions was utilized and the common MKL parameters for all tasks 
were learned during training. 

• CMTL [17]: in this work, the tasks are grouped into disjoint clusters, such that the model parameters 
of the tasks belonging to the same group are close to each other. 

• Whom [19]: clusters the task, into disjoint groups and assumes that tasks of the same group can 
jointly learn a shared feature representation. 

• FlexClus [30]: a flexible clustering structure of tasks is assumed, which can vary from feature to 
feature. 

• CoClus [26]: a co-clustering structure is assumed aiming to capture both the feature and task rela¬ 
tionship between tasks. 

• MeTaG [16]: a multi-level grouping structure is constructed by decomposing the matrix of tasks’ 
parameters into a sum of components, each of which corresponds to one level and is regularized with 
a L 2 -norm on the pairwise difference between parameters of all the tasks. 

5.1 Experimental Settings 

For all experiments, all kernel-based methods (including STL, MTL and our method) utilized 1 Linear, 
1 Polynomial with degree 2, and 8 Gaussian kernels with spread parameters {2°,...,2 7 } for MKL. All 
kernel functions were normalized as k(x, y ) •<— k(x, y)/yjk(x, x)k(y,y). Moreover, for CMTL, Whom and 
CoClus methods, which require the number of task clusters to be pre-specified, cross-validation over the 
set {1,... ,T/2} was used to select the optimal number of clusters. Also, the regularization parameters of 
all methods were chosen via cross-validation over the set {2 -10 ,..., 2 10 }. 

5.2 Experimental Results 

We assess the performance of our proposed method compared to the other methods on 7 widely-used data 
sets including 3 real-world data sets: Wall-Following Robot Navigation (Robot), Statlog Vehicle Silhouettes 
(Vehicle ) and Statlog Image Segmentation (Image) from the UCI repository [14], 2 handwritten digit data 
sets, namely MNIST Handwritten Digit (MNIST) and Pen-Based Recognition of Handwritten Digits (Pen), 
as well as Letter and Landmine. 

The data sets from the UCI repository correspond to three multi-class problems. In the Robot data set, 
each sample is labeled as: “Move-Forward, “SlightRight-Turn”, “Sharp-Right-Turn” and “Slight-Left-Turn”. 
These classes are designed to navigate a robot through a room following the wall in a clockwise direction. 
The Vehicle data set describes four different types of vehicles as “4 Opel”, “SAAB”, “Bus” and “Van”. On 
the other hand, the instances of the Image data set were drawn randomly from a database of 7 outdoor 
images which are labeled as “Sky”, “Foliage”, “Cement”, “Window”, “Path” and “Grass”. 

Also, two multi-class handwritten digit data sets, namely MNIST and Pen, consist of samples of hand¬ 
written digits from 0 to 9. Each example is labeled as one of ten classes. A one-versus-one strategy was 
adopted to cast all multi-class learning problems into MTL problems, and the average classification accuracy 
across tasks was calculated for each data set. Moreover, an equal number of samples from each class was 
chosen for training for all five multi-class problems. 

We also compare our method on two widely-used multi-task data sets, namely the Letter and Landmine 
data sets. The former one is a collection of handwritten words collected by Rob Kassel of MIT’s spoken 
Language System Group, and involves eight tasks: ‘C’ vs. ‘E’, ‘G’ vs. ‘Y’, ‘M’ vs. ‘N’, ‘A’ vs. ‘G’, T 
vs. ‘J’, ‘A’ vs. ‘O’, ‘F’ vs. ‘T’ and ‘H’ vs. ‘N’. Each letter is represented by a 8 by 16 pixel image, which 
forms a 128 dimensional feature vector per sample. We randomly chose 200 samples for each letter. An 
exception is letter J, for which only 189 samples were available. The Landmine data set consists of 29 binary 
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classification tasks collected from various landmine fields. The objective is to recognize whether there is 
a landmine or not based on a region’s characteristics, which are described by four moment-based features, 
three correlation-based features, one energy ratio feature, and one spatial variance feature. 

In all our experiments, for all methods, we considered training set sizes of 10%, 20% and 50% of the 
original data set to investigate the influence of the data set size on generalization performance. An exception 
was the Landmine data set, for which we used 20% and 50% of the data set for training purposes due to its 
small size. The rest of data were split into equal sizes for validation and testing. 


Table 1: Experimental comparison between our method and seven benchmark methods 


10 % 

STL< 7 ) 

MTL< 5 - 42 ) 

CMTLl 6 - 33 ) 

Whom! 3 ' 25 ) 

FlexClus! 4 ' 33 ) 

Coclus! 4 ) 

MetaG( 5 ) 

Our Method! 1 ' 67 ) 

Robot 

84.51 (7) 

84.82( 6 > 

84.15 (s) 

88.90W 

88.34! 4 ) 

87.83( 5 ) 

88.77 (2 ) 

88.67 (3) 

Vehicle 

79.73< s) 

80.38 (6) 

80.23 {7) 

83.14 (4) 

82.45 {s) 

86.79 (1 ) 

83.53 (3) 

84.51 (2) 

Image 

97.08 (7) 

97.43 (3) 

97.09 (6) 

97 . 27 W 

98.05 (2) 

97.24( 5 ) 

97.05 (8) 

98.19! 1 ) 

Pen 

98.16 (7) 

98.28 (5 ' 5) 

95.78 (8) 

98.28 (5 ' 5) 

98.67 (3) 

99.26 (1 ) 

98.57 (4) 

99.12 (2) 

MNIST 

94.09 (7) 

94.87 (4) 

94.49 (6) 

95.56 (3) 

94.59 (5) 

93.09 (8) 

96.13 (2) 

96.70! 1 ) 

Letter 

84.12 (6) 

83.12 (s) 

85.62 (3) 

86.82( 2) 

83.72 (7) 

85.46 (4) 

85.4l( 5 ) 

87.41 (1) 

20 % 

St L (6) 

mtl( 4 - 43 ) 

cmtl( 614 > 

Whom! 3 ' 29 ) 

FlexClus! 5 ' 57 ) 

Coclus (4 ' 57 ) 

MetaG (4 ' 71 ) 

Our Method (1 - 14) 

Robot 

87.67 (7) 

88.23 (6) 

85.08 (s) 

90.760) 

90.15 (3) 

88.43(5) 

89.12( 4 ) 

90.34( 2 ) 

Vehicle 

85.88 (4) 

86.16 (3) 

82.29 {s) 

85.67 (6) 

85.29 {7) 

87.15 (2) 

85.78 (5) 

87.76 (1) 

Image 

97.41 (6) 

98.02 (3) 

97.32< 7) 

98.46< 2) 

97 . 44 ( 5 ) 

97.50( 4 ) 

97.29 (s) 

98.54 (1) 

Pen 

98.57 (7) 

99.01 (6) 

96.06 (8) 

99.14! 3 ) 

99.13! 4 ) 

99.30 (2) 

99.02 (4 ) 

99.63 (1) 

MNIST 

96.13 (6) 

96.71 (4) 

96.56 (5) 

96.76< 3) 

95.04( 7) 

94.09 (8) 

96.84 (2 ) 

97.86! 1 ) 

Landmine 

58.76 (s) 

61.89 (7) 

65.28 {2) 

62.53 (5) 

62.46 (6) 

63.52 (3) 

62.59 (4 ) 

65.82 (1) 

Letter 

88.75 (4) 

89.98 (2) 

88.24< 5) 

88 . 88 < 3) 

83.79 (7) 

82.26 (8 ) 

87.99 (6) 

90.72(4) 

50% 

stl( 5 - 64 > 

mtl( 3 - 85 ) 

cmtl( 6 - 29 > 

Whom< 3 ' 29 ) 

FlexClus! 6 ' 21 ) 

Coclus (5 ' 29 ) 

MetaG (4 ' 42 ) 

Our Method! 1 ) 

Robot 

91.26 (5 ' 5) 

91.49 (3) 

86.26< 8) 

91.70 (2) 

91.26*- 5 ' 5 ) 

89.04( 7 ) 

91.27( 4 ) 

92.41(4) 

Vehicle 

88.33 (3) 

88.71 (2) 

83.91 (8) 

87.3 (5) 

86.72! 7 ) 

87.55( 4 ) 

86.81 (6) 

89.83(4) 

Image 

98.40 (6) 

98.43 (5) 

97.56 (8) 

98.58 (2) 

98.04 (7) 

98.52( 3) 

98.49 (4 ) 

99.07(4) 

Pen 

98.77 (7) 

99.23 (5) 

96.17 (8) 

99.32 (4) 

99.33 (3) 

99.34( 2) 

99.21 (6) 

99.77(4) 

MNIST 

97.20 (6) 

97.37 (4) 

97.31< s) 

97.78( 3 ) 

96.60 {7) 

95.87 (s ) 

98.46< 2 ) 

98.64(4) 

Landmine 

63.76 (s) 

64.98 (6) 

66.76 (2) 

65.57< 4) 

64.87 (7) 

65.15( 5 ) 

66.24 (3) 

67.15(4) 

Letter 

91.18 (4) 

91.62< 2 > 

90.97 (5) 

91.25! 3 ) 

86.47 (7) 

86.27 (s ) 

90.66 (6) 

92.49(4) 


In Table 1, we report the average classification accuracy over 20 runs of randomly sampled training sets 
for each experiment. Note that we utilized the method proposed in [10] for our statistical analysis. More 
specifically, Friedman’s and Holm’s post-hoc tests at significance level a = 0.05 were employed to compare 
our proposed method with the other methods. 

As shown in Table 1, for each data set, Friedman’s test ranks the best performing model as first, the 
second best as second and so on. The superscript next to each value in Table 1 indicates the rank of the 
corresponding model on the relevant data set, while the superscript next to each model reflects its average 
rank over all data sets for the corresponding training set size. Note that methods depicted in boldface 
are deemed statistically similar to our model, since their corresponding p-values are not smaller than the 
adjusted a values obtained by Holm’s post-hoc test. Overall, it can be observed that our method dominates 
three, six and five out of seven methods, when trained with 10%, 20% and 50% training set sizes respectively. 

Also, in Figure 1, we provide better insight of how the grouping of task feature spaces might be determined 
in our framework. For the purpose of visualization, we applied two Gaussian kernel functions with spread 
parameters 2 and 2 8 and used the Letter multi-task data set. 

In this figure, the x and y axes represent the weights of these two kernel functions for each task. From 
Figure 1 (a), when a small training size (10%) is chosen, it can be seen that our framework yields a cluster of 
3 tasks, namely {“A” vs “G”, “A” vs “O”, “G” vs “Y”} that share a common feature space to benefit from 
each other’s data. However, as the number n of training samples per task increases, every task is allowed to 
employ its own feature space to guarantee good performance. This is shown in Figure 1 (b), which displays 


10 









Table 2: Comparison of our method against the other methods with the Holm test 


10% 

STL 

MTL 

CMTL 

Whom 

FlexClus 

Coclus 

MeTaG 

Test statistic 

3.93 

2.13 

3.49 

1.25 

2.40 

2.62 

2.29 

p value 

0.0005 

0.0138 

0.0022 

0.2869 

0.0777 

0.1214 

0.1214 

Adjusted a 

0.0071 

0.0083 

0.0100 

0.0125 

0.01667 

0.0250 

0.0500 

20% 

STL 

MTL 

CMTL 

Whom 

FlexClus 

Coclus 

MeTaG 

Test statistic 

3.71 

2.51 

3.82 

1.64 

3.38 

2.62 

2.73 

p value 

0.00021 

0.0121 

0.0001 

0.1017 

0.0007 

0.0088 

0.0064 

Adjusted a 

0.0083 

0.0250 

0.0071 

0.0500 

0.0100 

0.01667 

0.0125 

50% 

STL 

MTL 

CMTL 

Whom 

FlexClus 

Coclus 

MeTaG 

Test statistic 

3.55 

2.18 

4.04 

1.75 

3.98 

3.27 

2.61 

p value 

0.0004 

0.0291 

0.0001 

0.0809 

0.0001 

0.0011 

0.0089 

Adjusted a 

0.0100 

0.0250 

0.0071 

0.0500 

0.0083 

0.0125 

0.01667 


the results obtained for a 50% training set size. Note, that the displayed MKL weights lie on the 9\ + 62 = 1 
line due to the framework’s L\ MKL weight constraint. 


0.9 
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(a) Traning set size 10% 


*F vs T 

HvsN #<M vsN 
I vs 

A vs G, A vs O, G vs Y # 

C vs E # 




F vs T # 


I VS 

vs N 


H vs N # 

A G %A vs 

O 

G vs Y. 

C vs E_ 
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0i 

(b) Traning set size 50% 


Figure 1: Feature space parameters for Letter multi-task data set 


6 Conclusions 

In this work, we proposed a novel MT-MKL framework for SVM-based binary classification, where a flexible 
group structure is determined between each pair of tasks. In this framework, tasks are allowed to have a 
common, similar, or distinct feature spaces. Recently, some MTL frameworks have been proposed, which 
also consider clustering strategies to capture task relatedness. However, our method is capable of modeling 
a more general type of task relationship, where tasks may be implicitly grouped according to a notion of 
feature space similarity. Also, our proposed optimization algorithm allows for a distributed implementation, 
which can be significantly advantageous for MTL settings involving large number of tasks. The performance 
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advantages reported on 7 multi-task SVM-based classification problems largely seem to justify our arguments 
in favor of our framework. 
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Supplementary Materials 

A useful lemmas in deriving the generalization bound of Theorem 1 is provided next. 

Lemma 1. Let A,B £ M. NxN and let er £ R w be a vector of independent Rademacher random variables. 
Let o denote the Hadamard (component-wise) matrix product. Then, it holds that 


E ct {(er'Aer) (er'Ber)} = trace {A} trace {B} + 2 (trace {AB} — trace {A o B}) (29) 

Proof. Let [•] denote the Iverson bracket, such that [predicate] = 1, if predicate is true and 0, if false. The 
expectation in question can be written as 


E ct {(er'Aer) (er'Ber)} = ^ a i:j b kt i'K {crjajcrfccr;} 


(30) 


where the indices of the last sum run over the set {1,..., N }. Since the components of er are independent 
Rademacher random variables, it is not difficult to verify the fact that E {cqerjcrfccq} = 1 only in the following 
four cases: {i = k,j = l,i ^ l}, {i = j,k = l,i ^ k}, {i = l,k = j , i ^ fc} and {* = j , j = k,k = Z}; in all other 
cases, E {eqerjcr/ccq} = 0. Therefore, it holds that 


E {aiUjCTkai} = [i = k][j = l\[i ^ l\ + [i = j][k = l][i ^ k] 
+ [i = l][k = j][i + k\ + [i = j)[j = k][k = l} 


Substituting (31) into (30), after some algebraic operations, yields the desired result. 


(31) 

□ 


Proof of Theorem 1 

By utilizing Theorems 16, 17 in [24], it can be proved that given a multi-task HS jF ", defines as a class of 
functions / = (/i,..., /t) : X >->• R T , for all / £ T , for S > 0 and for fixed p > 0, with probability at least 
1 — 8 the following holds 


R(f) < R P (f) + -SHsC? 7 ) + 3t 


' log? 

2 Tn 


(32) 


where the ERC 91s (J 7 ) is given as 


^ s(jr) = xr Ea 


sup J2J2^ift(xi) 

y=(/i,..,/r)e^ t=li=1 


(33) 


and the p-empirical large margin error R p (f) for the training sample S = { {%t-> Vt) * s defined as 


T n 


R P (f) = ^ min ( x > t 1 _ ylM x l)/p\+) 


Also, from eqs. (1) and (2) in [9], we know that w t = i along with constraint ||io t || 2 < Rt , 

is equivalent to a. t K t ot t < Rt- Then we can observe that \/x £ S and t £ 1,... ,T, the decision function 
defined as f t (x t ) = (w t , (f> t {x t ))H t , g is equivalent to f t (x t ) = a{ K t{xf t , x t ), where K t = 
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So, based on the definition of empirical Rademacher complexity given in (33), we will have 


KsiH) = —E a 


~ nT E ° 


T n,n 

SUP E E 

(0),ot t £f2(ot) t —1 i,j=l 


sup IX* tOL t 

e t ei7" {o),a. t eQ(a) t=\ 


(34) 


where cr t = [aj ,..., cr™] , a* = [aj,..., a"] , .Kb £ R” xrl is a kernel matrix whose (i,j )-th elements is defined 
as 1 0t l Km(xt,x{), 42(a) = {a t | a t K t a t < At, Vt} and 42 (0) is defined as (25). 

It can be observed that the maximization problem with respect to a* can be handled as T independent 
optimization problem, as 42(a) is separable in terms of at- Also, it can be shown that using Cauchy-Schwartz 
inequality, the optimal value of a t is achieved when K t ' a t is colinear with K t ' cr t , which gives 


sup u t K t a t = \/ cr' t K t cr t R t 

a t GO(a) 


Assuming R t < R Vt, (34) now becomes 




M sup E 


/R 

~ AT 

= I - su ?. E 


e t ew ( 6 ) t= 


M 


L^\ ZZ 

= 1 > m—1 




^ t en”( 0 ) t=1 


(35) 


where Q t = [9],... ,6^] , it t = [uj,..., , and it™ = cr t K™(Tt . Note that (35) can also be upper- 

bounded. In particular, assuming uit = \J9 t Ut, and using the Holder’s inequality \\xy ||r < ||a:||p||y|| g , for 
p = 2, r = 1 and y = l n , we will have 


1/2 


E v = E = XE *) 2 = ^ 


N 


J2 e t ut 


Therefore, we can upper bound the Rademacher complexity (35) as follows 


Mn) = §E.\ S v^9' t u t 


1 R „ , 
<-\L E a sup 
n V 1 


e t en"(e) 


\i E 


I 

nV T 


E CT < sup Wtrace {0'U} > 

v J 


(36) 


where © = [#i,..., Qt\ £ R MxT and U = [u\ 1 ..., uy] £ R MxT . Also, by contradiction, it can be easily 
proved that 


argmax trace {0'U} = arg max |trace |© u| | 
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Using the Lagrangian multiplier method, the optimization w.r.t. 0 yields the optimal value for 0 as 


0 * = 


2 aT 


UP 2 


where Pt £ R TxT is a centering matrix as we defined in Sect. 3. Moreover, a = (I/ 27 )-^/a— (1 /T)b, 
a = trace jui/j, and b = trace |u1t1t U’ j. 

substituting the optimal value of © in (36), finally yields 


Xs(n) < E <T {v'o-(i/T)h } :1/2 

By applying Jensen’s inequality twice, we obtain 

*sm < {VE ff (a-(l/T) 6)} 1/2 (37) 


From the definition, we can see that both a and b depend on variable a. If we define u m = [cr 1 FC™cr 1 ,..., er^ 

as the row vector of matrix U, and d = trace |X!m=i R-T |... trace | J2m=i R-t }] then with the help of 
Lemma 1, it can be shown that 


T M 

E A a)=dd + 2 j:j: [trace {K?K?}~ trace {K™ o K?}\ 

t =1 m —1 

T M 

E a (b) = d ItI t d H- 2 £ £ [trace {K?K?} - trace {K? o K™}] (38) 

t= 1 m=1 

Considering the fact that trace { K™ o K™ } > 0, and trace {K?K?} > 0 Vi, m, n, and assuming that 
- K 7( x, x) < 1 Vx, t , m, it can be shown that 

E.(«) ^ ^(6) < TMV |l + ^ + ££} < 3TM 2 n 2 (39) 

Combining (37), and (39) and after some algebra operations, we conclude that, if 

Ruh m 4 ( 40 ) 

then R(H) < R u b (R)- This last fact in conjunction with (32) conclude the theorem’s statement. 


<jt\ 


16 













